<!-- Copyright 2017 Yahoo Holdings. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
<title>Juniper Configuration Documentation</title>
<h1>Juniper Configuration Documentation</h1>

<b>Note:</b> This document describes in details the functionality of Juniper v.2.1.0.
The document has gradually become more and more for internal use for
instance for detailed tuning
by Professional Service. A more high level and less detailed user level
configuration documentation is also available.
<p>
Juniper implements a combined proximity ranking and dynamic teaser result
processing module.This module is intended to be interfaced to by different
Fast software modules on demand. Currently, the only available module that
makes use of Juniper is the Fast Server module, in which Juniper currently
is an integrated part of <dfn>fsearch</dfn> (the search engine executable
that runs on each search node in the system).

<h2>Juniper simple description of functionality/implementation</h2>

The document body is stripped for markup during document processing and
stored as an extended document summary field. 
A max limit of how much of the document that gets stored is configurable as
of Fast Server v.4.17 (see the Fast Server�configuration documentation for
details).  For each document
on the result page, this document extract is retrieved and fed through
Juniper which will perform the following steps:

<ol>
  <li> Scan the stripped document text (docsum) for matches of the query,
  create a data structure containing information about those matches,
  and provide a quality measure (rank boost value) that can be used as a
  metric to determine the quality of the document wrt. proximity and
  position of the search string in the document. The data structure
  contains ao. a list of matches of the query ordered by quality (see below
  for the quality measure). The document quality measure is computed from the
  quality measure of the best of the individual matches and the total
  number of hits within the document.

  <li> Generate a dynamic teaser based on the data structure previously
  generated. The dynamic teaser is composed of a number of text segments
  that include the "best" matches of the query in that
  document. The teaser is presented with the query words highlighted.
  The definition of highlight is configurable. If the document is short
  enough to fit completely into the configured teaser length, it will be
  provided as is, but with highlighting of the relevant keywords.
</ol>

Step 2 is only necessary if the teaser is going to be displayed, which
might be a decision taken on basis of the quality measure provided in step 1.

<h3>Quality measure</h3>
The text segments matching the query are ranked by (in decreasing order of
significance):

<ol>
  <li> Completeness * keyword weight - 
       higher ranking if more search words are present in
       the same context, and relatively higher weight on matches that
       contains "important" terms compared to matches with stop words if
       equal number of words.
  <li> Proximity - query terms occurring near each other is better
  <li> Position - earlier in document is better
</ol>

The number of matches selected is based on text segment lengths including
a configurable amound of surrounding text, the number of matching segments
to use (configurable) and the required total summary
size (configurable). The final set of matches is 
returned with markup for the hits and the abbreviated sections
(continuation symbol).

The query used for teaser generation has undergone proper name recognition
and English spell checking. Highlighting is done on individual terms of the
query. In particular, phrases are broken down into individual terms, but
the preference to proximal terms will maintain the phrasing in the
generated teaser.  

Lemmatization by expanding documents with word inflections cannot be used
by Juniper. In the future, Juniper would expand the query based on the
original query and language information. This functionality is not
available yet, thus lemmatized terms will in general not be highlighted by
Juniper.

Currently Juniper uses an alternative, simple brute force stemming
algorithm that basically allows prefixes of the document words to match if
the document word in question is no longer than P (configurable) bytes
<i>longer</i> than the query keyword. 
This algorithm works well for keywords of a certain size, but not for very
short keywords. Thus an additional configuration variable defines a lower
bound for what lengths of keywords that will be subject to this algorithm.
With this simple algorithm in place, typical weak form singular to plural
mapping will get highlighted while the opposite, going from a long form to
a shorter one will not work as might be expected. Eg. this algorithm does
not change the keywords themselves. Consequently, the shorter forms of the
keywords are more likely to give non-exact hits in the dynamic summary.


<h2>Fast Search configuration of Juniper</h2>
Enabling Juniper functionality within Fast Search is done on a per field
basis by means of override specifications in summary.map. 
Currently the following override specs are supported by Juniper:
<p>
<ul>
<li><pre>override &lt;outputfield&gt; dynamicteaser &lt;inputfield&gt;</pre>
<li><pre>override &lt;outputfield&gt; dynamicteasermetric
&lt;inputfield&gt;</pre>
<li><pre>override &lt;outputfield&gt; juniperlog &lt;inputfield&gt;</pre>
</ul>
<p>
Details of the override directive can be found in <i>Fast Search 4.13 -
Dynamic Docsum Generation Framework</i>.
The <tt>dynamicteasermetric</tt> field provides a ranking of the document
based on a corresponding metric as that used to select between individual
matches for dynamic teaser generation inside a document. See the section on
using Juniper for proximity boosting <a href="#proximity">below</a>
The <tt>juniperlog</tt> field is new as of Juniper 2.0, and is used to
retrieve the information generated by Juniper by means of the log query
option, see the runtime option table <a href="#dynpar">below</a>.

Note that when integrated into Data Search 3.1 and later, 
this part of the configuration will be generated via
<ol>
<li>The index profile
<li>The index configuration (indexConfig.xml) by the config server.
</ol>

<h3>Configuration levels</h3>
When integrated into Fast Search, Juniper receives its default parameters from
global settings in the <dfn>fsearchrc</dfn> config file. 
These configuration parameter settings must be preconfigured at
<dfn>fsearch</dfn> process startup time. Two levels of system configuration
is currently supported, 
<ol>
<li><b>System default configuration:</b> This is the configuration settings
exemplified by the parameter descriptions below.
<li><b>Per field configuration:</b>By using the field names instead of the
string "juniper" as prefix, the default setting can be overridden on a
<dfn>result field</dfn> basis. Eg. setting for instance
<pre>myfield.dynsum.length 512</pre> (see below) would allow the
<tt>myfield</tt> result field to receive a different teaser length.
</ol>
In addition Juniper 2.x supports changing certain subset of the parameters
on a per query basis. See <a href="#dynpar">separate section</a> on this below.
<p>
<b>Performance note:</b> The per field configuration possibility should be used with
care since overriding some parameters may cause significant computation
overhead in that Juniper would have to scan the whole text multiple times.
Changing the <tt>dynsum</tt> group of fields is generally quite performance
conservative (only the teaser generation phase would have to be repeated), 
while changing any of the <tt>stem</tt> or <tt>matcher</tt>
fields would require a different text scan for each combination of
parameters.

<h3>Arbitrary byte sequences in markup parameters</h3>
To allow arbitrary byte seqences (such as low ascii values) to be used to
denote highlight on/off and continuation symbol(s), Juniper now accepts
strings on the form \xNN where the N's are hex values [0-9a-fA-F].
This will be converted into a byte value of NN. Note that Juniper exports
UTF-8 text so this sequence should be a valid UTF-8 byte sequence. No
checks are performed from Juniper on the validity of such strings in the
<dfn>fsearch</dfn> domain.
As a consequence of this, occurrences of backslash must be escaped
accordingly (<dfn>\\</dfn>).


<h3>Blanks in text parameters must be escaped</h3>
Note that <dfn>fsearchrc</dfn> does not accept blanks in the
parameters. To allow more complicated highlight markup, the sequence
<dfn>\x20</dfn> must be used as space in text fields. 

<p>
<h3><a name="em"></a>Escaping markup in the summary text</h3>
Since Juniper may supply markup through the use of the
highlight and continuation parameters, problems may occur if the
analysed text itself contains markup. To avoid this, Juniper may be
configured to escape the 5 XML/HTML markup symbols
(<dfn>&quot;&amp;&lt;&apos;&gt;</dfn>) before adding the mentioned
parametrized symbols. See the description of the
juniper.dynsum.escape_markup parameter <a href="#empar">below</a>.
<p>
The following variables are available for static, global configuration for
a particular search node:
<p>

<table><a name="conftable"></a>
  <tr bgcolor="#f0f0f0"><td><b>Parameter name</b></td><td><b>Default
  value</b></td><td><b>Description</b></td>
  </tr>

  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.highlight_on</td><td>&lt;b&gt;</td>
    <td>A string to be included <i>before</i> each hit in the generated summary</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.highlight_off</td>
    <td>&lt;/b&gt;</td><td>A string to be included <i>after</i> each hit in
        the generated summary</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.continuation</td>
    <td>...</td><td>A string to be included to denote
        abbreviated/left out pieces of the original text in the generated summary</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.separators</td>
    <td>\x1D\x1F</td><td>A string containing characters that are added for
    word separation purposes (eg.CJK languages and German/Norwegian
    etc. word separation). This list should contain non-word characters
    only for this to be meaningful. Also, currently only single byte
    characters are supported. These characters wil be removed from the
    generated teaser by Juniper.</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.connectors</td>
    <td>-'</td><td>A string containing characters that may connect two word
    tokens to form a single word. Words connected by a single such
    character will not be splitted by Juniper when generating the teaser.</td>
  <tr bgcolor="#f0f0f0">
    <td><a name="empar"></a>juniper.dynsum.escape_markup</td>
    <td>auto</td><td>See <a href="#em">description</a> above. Accepted values:
      <dfn>on</dfn>,<dfn>off</dfn> or <dfn>auto</dfn>. If <dfn>auto</dfn> is
      used, Juniper will escape markup in the generated summary if any of the symbols
      <dfn>highlight_on</dfn>, <dfn>highlight_off</dfn> or
      <dfn>continuation</dfn> contains a <dfn>&lt;</dfn> as the first
      character.
    </td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.length</td>
    <td>256</td><td>Length of the generated summary in bytes. This is a
    hint to Juniper. The result may be slightly longer or shorter depending
    on the structure of the available document text and the submitted
    query.</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.max_matches</td>
    <td>4</td><td>The number of (possibly partial) set of keywords
    matching the query, to attempt to include in the summary. The larger this
    value compared is set relative to the <i>length</i> parameter, the more
    dense the keywords may appear in the summary.</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.min_length</td>
    <td>128</td><td>Minimal desired length of the generated summary in
    bytes. This is the shortest summary length for which the number of
    matches will be respected. Eg. if
    a summary appear to become shorter than <i>min_length</i> bytes with
    <i>max_matches</i> matches, then additional matches will be used if available.</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.dynsum.surround_max</td>
    <td>80</td><td>The maximal number of bytes of context to prepend and append to
    each of the selected query keyword hits. This parameter defines the
    max size a summary would become if there are few keyword hits
    (max_matches set low or document contained few matches of the
    keywords.</td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.stem.min_length</td>
    <td>5</td><td>The minimal number of bytes in a query keyword for
    it to be subject to the simple Juniper stemming algorithm. Keywords
    that are shorter than or equal to this limit will only yield exact
    matches in the dynamic summaries.
    </td>
  <tr bgcolor="#f0f0f0">
    <td>juniper.stem.max_extend</td>
    <td>3</td><td>The maximal number of bytes that a word in the document
    can be <i>longer</i> than the keyword itself to yield a match. Eg. for
    the default values, if the keyword is 7 bytes long, it will match any
    word with length less than or equal to 10 for which the keyword is a prefix.
    </td>
  </tr>
  <tr bgcolor="#f0f0f0">
    <td>juniper.matcher.winsize</td>
    <td>400</td><td>The size of the sliding window used to determine if
    multiple query terms occur together. The larger the value, the more
    likely the system will find (and present in dynamic summary) complete
    matches containing all the search terms. The downside is a potential
    performance overhead of keeping candidates for matches longer during
    matching, and consequently updating more candidates that eventually
    gets thrown.
    </td>
  </tr>
  <tr bgcolor="#f0f0f0">
    <td>juniper.proximity.factor</td><td>0.25</td><td>
A factor to multiply the internal Juniper metric with when producing
proximity metric for a given field. A real/floating point value accepted
Note that the QRserver (see <a href="#qrserver">below</a>) 
also supports a factor that is global to all proximity
metric fields, and that is applied in addition.	</td>
  </tr>

</table>

<h2><a name="dynpar"></a>Alternate behaviour of Juniper on a per query
basis</h2>
As of Juniper v.2.x and Fastserver v.4.17, and QRserver for Data Search
3.2, Juniper supports a number of Juniper specific options that can be
provided as part of the URL. The format of the option string is 
<pre>
  juniper=&lt;param_name&gt;.&lt;value&gt;[_&lt;param_name&gt;.&lt;value&gt;]*
</pre>
As an example, consider the following URL addition:
<pre>
  juniper=near.2_dynlength.512_dynmatches.8
</pre>
If this string is present in the URL, Juniper would generate teasers that
are up to twice as long and contains up to twice as many matches of the
query compared to the default values. 
In addition, teasers (and <a href="#proximity">proximity metric</a>) will
only be
generated for those documents that fulfills the extra constraint that there
exist at least one complete match of the query where the distance in words
between the first and the last word of the query match is no more than 2 (+
the number of words in the query).

<h3>Supported per query options in Juniper 2.1.0</h3>

<table>
  <tr bgcolor="#f0f0f0"><td><b>Parameter
  name</b></td><td><b>Corresponding config name, see <a href="#conftable">above</a></b></td><td><b>Description</b></td></tr>
  <tr bgcolor="#f0f0f0"><td>dynlength</td><td>dynsum.length</td>
  <td>The desired max length of the generated teaser</td></tr>
  <tr bgcolor="#f0f0f0"><td>dynmatches</td><td>dynsum.max_matches</td>
  <td>The number of matches to try to fit in the teaser</td>
  </tr>
  <tr bgcolor="#f0f0f0"><td>dynsurmax</td><td>dynsum.surround_max</td>
  <td>The maximal amount of surrounding context per keyword hit</td>
  </tr>
  <tr bgcolor="#f0f0f0"><td>near</td><td><i>N/A</i></td><td>Specifies a
  proximity search where keywords should occur closer than the specified
  value in number of words not counting the query terms themselves.</td>
  </tr>
  <tr bgcolor="#f0f0f0"><td>stemext</td><td>juniper.stem.max_extend</td><td>
  The maximal number of bytes that a word in the document can be longer
  than the keyword itself to yield a match.</td>
  </tr>
  <tr bgcolor="#f0f0f0"><td>stemmin</td><td>juniper.stem.min_length</td><td>
  The minimal number of bytes in a query keyword for it to be subject to
  the simple Juniper stemming algorithm.</td> 
  </tr>
  <tr bgcolor="#f0f0f0"><td>within</td><td><i>N/A</i></td><td>Same as
  <i>near</i> with the additional constraint that matches of the query must
  have the same order of the query words as the original query.</td>
  </tr>
  <tr bgcolor="#f0f0f0"><td>winsize</td><td>juniper.matcher.winsize</td><td>
  The size of the sliding window used to determine if multiple query terms
  occur together.</td>
  </tr>
  <tr bgcolor="#f0f0f0"><td>log</td><td><i>N/A</i></td><td>Internal debug
  option (privileged port only). Value is a bitmap that allows selectively
  enabled log output to be generated by Juniper for output into a
  juniperlog override configured summary field. Useful only with a special
  template that makes use of this information. Currently the only
  supported bit is 0x8000 which will provide a html table with up to 20 of the
  topmost matches of each document, and their identified proximity
  (distance) and rank.
</td>
</table>

<h3>Juniper debug template in Data Search</h3>
Template support for the log parameter as well as extracting the whole
juniper input document text is provided by Data Search 3.2 by means of the
<tt>jsearch</tt> page from the qrserver port. Replace <tt>asearch</tt>
with <tt>jsearch</tt> 
in the URL of a qrserver privileged port search. 

<h2>Using Juniper for proximity boosting with the QRserver</h2>
In order to use Juniper to boost hits that have good proximity of the query
(or to filter the hits based on NEAR or WITHIN constraints) the QRserver
would need to be provided with the following URL addition:
<pre>
rpf_proximitybooster:enabled=1
</pre>
Note that Juniper will return 0 as proximity metric (dynamicteasermetric)
if the query with juniper option constraints cannot be satisfied by
the information in the configured input field. Thus if the selection of a
hit is done solely on the basis of information not present in the Juniper input
(such as the title in the default configuration)
proximity boosting may demote such hits. A solution for this problem has
been proposed for future versions of Juniper.

<h3><a name="qrserver"></a>Supported QRserver options to use with proximity
boosting via Juniper</h3>
QRserver behaviour wrt. proximity boosting can be set both in configuration
(at QRserver startup) or on a per query basis. In the below table, some of the
default configuration settings are listed together with their corresponding
runtime setting, if any. Consult QRserver documentation for the complete
list of options.
<p>
<table>
  <tr bgcolor="#f0f0f0"><td><b>Config parameter
  name</b></td><td><b>Corresponding runtime (URL)
  syntax</a></b></td><td><b>Description</b></td></tr>
  <tr bgcolor="#f0f0f0"><td>rp.proximityboost.enabled=1</td><td>N/A</td><td>Configure for
  proximity boosting in the QRserver (not necessarily enable it)</td></tr>
  <tr
  bgcolor="#f0f0f0"><td>#rp.proximityboost.default</td>
  <td>rpf_proximityboost:enabled=1</td><td>Enable proximity
  boosting</td></tr>
  <tr
  bgcolor="#f0f0f0"><td>rp.proximityboost.factor=0.5</td>
  <td>N/A</td><td>A value that the combined proximity boost value
  calculated possibly from multiple fields, scaled by their individual
  factors are multipled with before adding it to the Fastserver rank value
  to be used to reorder hits.</td></tr>
  <tr bgcolor="#f0f0f0"><td>rp.proximityboost.hits=100</td>
  <td>rpf_proximityboost:hits=100</td><td>The number of Fastserver hits to
  retrieve as basis for the reordering.</td></tr>
  <tr
  bgcolor="#f0f0f0"><td>rp.proximityboost.maxoffset=100</td>
  <td>N/A</td><td>The maximal offset within the list of hits that will be
  subject to any proximity boost reordering/filtering. Hits above this
  range in the original result set will not be subject to proximity boosting.</td></tr>
</table>

<h2>Configuring Juniper within Data Search</h2>
Except where explicitly noted, configuring Juniper for Data Search is
similar to configuring for Real-Time Search. As of Data Search 3.0 Juniper
is by default configured and enabled in Data Search.

<h2>Configuring Juniper within Fast Real-Time Search</h2>
Juniper is provided as part of Real-Time Search (through Fast Search) 
starting with version 2.4. To enable the Fast Search integrated Juniper in
a Real-Time Search environment, see the documentation extensions to
Real-Time Search 2.4. Note that to configure Juniper within Real-Time
Search, the configuration variables should be put in the
<tt>etc/fsearch.addon*</tt> file(s) which will be used as input when Real-Time
Search generates <tt>fsearchrc</tt> files for all configured search
engines. Also a proper <tt>summary.map</tt> file is needed to enable the
dynamic summaries on particular fields.

<h2>Configuring Juniper for Fast Search v.4.15 and higher</h2>
Newer versions of Fast Search provide template support to allow different
Juniper markup depending on the type of display desired (plain,html or
xml). 

All the http frontends that needs an interpretation of the highlight
information provided by Juniper should have the following setup for Juniper:
<p>
<table>
<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_on</td><td>\02</td></tr>
<tr bgcolor="#f0f0f0"><td>juniper.dynsum.highlight_off</td><td>\03</td></tr>
<tr bgcolor="#f0f0f0"><td>juniper.dynsum.continuation</td><td>\1E</td></tr>
</table>
<p>
The actual frontend markup configuration then takes place by setting
variables such as
<p>
<table>
<tr bgcolor="#f0f0f0"><td>tvm.dynsum.html.highlight_on</td></tr>
<tr bgcolor="#f0f0f0"><td>tvm.dynsum.xml.highlight_off</td></tr>
</table>
<p>
in the relevant rc file. 

Note that for a Data Search installation, the QR server has
the responsibility of providing dynamic teasers, consequently the 
<tt>etc/qrserver/qrserverrc</tt> file should provide the above
configuration.
<p>
For a similar Web Search configuration, using a "bare-bone" top level
fdispatch instance, the <tt>fdispatchrc</tt> file is the appropriate place.
<p>
For a "standalone" Real-Time Search setup, the appropriate configuration
file(s) is the fdispatch.addon file(s).

<h2>How to report bugs/errors related to Juniper</h2>
Errors/problems related to Juniper can be divided into two categories:
<ol>
  <li> Problems with specific documents/teasers
  <li> System errors/instability etc.
</ol>
Due to the complexity of a full, running system it is much easier for all
parts if the particular query/document pair triggering the problem can be
identified and analysed off-line.

<h3>Problems with specific teasers</h3>
Problems of this category is likely to occur because there are so many
combinations of queries and documents that it is not possible to
test for all cases. To be able to analyse such problems, it is vital that
the exact (byte-by-byte) teaser generation source (document summary input
to Juniper) can be made available together with the exact query as
presented to Juniper. To determine this requires the following information:
<ol>
   <li> The teaser source docsum. The name of the docsum field is dependent
   on the configuration in summary.map. The data should be provided without
   any post processing performed, if possible, to avoid missing problems
   related to bad input data such as malformet UTF8 characters.
   <li> The original query as submitted by the user
   <li> The expanded query (available under <tt>var/log/querylogs/</tt>)
   <li> A corresponding <tt>fsearchrc</tt> (pure fastserver4)  or
   <tt>fsearch.addon</tt> (Real Time Search/DS 3.x) file used by the
   fsearch process that performed the task.
</ol>


<h3>System errors/instability problems</h3>
Problems of category 2 should, if occurring at all only be associated with
development/beta releases. If such an unfortunate event should happen, the
following information <dfn>in addition to the information associated with
category 1</dfn> would be useful to pin down the problem:
<ol>
  <li> Core file of <dfn>fsearch</dfn> accompanied with the associated
  <dfn>fsearch</dfn> binary.
  <li> Log files from the crashed process (in Data Search these will be
  present as <tt>var/fsearch-*.log</tt> and <tt>var/log/stdout.log</tt>.
</ol>
